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LEXING  TON 


MASSACHUSETTS 


ABSTRACT 


The  design  and  implementation  of  a hardware  Fermat  Number 
Transform  (FNT)  is  described.  The  arithmetic  logic  design  is 
treated  in  detail  and  a new  data  representation  for  integers 
modulo  a Fermat  number  is  derived.  Some  results  of  filter 
implementation  with  the  FNT  are  shown  to  illustrate  the  use 
of  the  hardware.  Finally,  the  FNT  is  compared  with  the  Fast 
Fourier  Transform  (FFT)  on  the  basis  of  hardware  required  for 
a pipeline  convolver. 
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V 


I. 


Introduction 


The  use  of  number  theory  transforms  for  implementing  digital  convolution 
is  attractive  from  a theoretical  point  of  view  because  it  is  possible  to 
derive  a transform  that  requires  no  multiplications.  Since  multipliers  are  a 
major  hardware  expense  in  Fast  Fourier  Transform  (FFT)  pipeline  convolvers 
or  in  direct  form  convolvers,  the  potential  for  building  cheaper  and/or 
faster  convolvers  should  lie  with  number  theory  transforms.  Many  other 
factors  cloud  the  picture  and  the  savings  in  multiplier  hardware  can  be  offset 
by  increased  memory  hardware  and  transform  size  in  some  cases.  It  is  the  purpose 
of  this  paper  to  examine  some  of  these  hardware  issues. 

In  the  realm  of  radar  signal  processing,  the  potential  for  greater  speed 
is  worth  exploring.  For  this  reason  a small  prototype  number  theory  transform 
has  been  constructed.  This  hardware  consists  of  the  computational  element 
(butterfly) for  a 64-point,  16-bit  Fermat  Number  Transform  (FNT) . 

In  the  process  of  designing  the  butterfly,  a new  coding  scheme  for  the 
data  was  developed  to  facilitate  the  arithmetic  operations  modulo  the  Fermat 
number.  The  experience  gained  in  designing  and  building  this  hardware  is  the 
basis  for  estimates  of  the  size  of  pipeline  FNT  convolvers  for  possible  use 
in  radar  signal  processors.  The  result  of  the  hardware  sizing  of  the  FNT 
versus  a pipeline  FFT  indicates  that  the  anticipated  savings  can  be  realized 
for  small  systems  (e.g.,  length  32  convolution)  when  the  data  to  be  filtered 
is  real.  However,  in  larger  systems  where  one  must  use  two-dimensional  convolution 
to  implement  the  one-dimensional  convolution,  the  savings  in  multiplier  hardware 
are  offset  by  the  increased  transform  size  and  the  corresponding  increase  in 
memory  size  and  reference  spectrum  multiplier  hardware.  In  this  case,  when  the 
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data  to  be  filtered  are  real,  the  FNT  still  offers  a small  decrease  in  the  amount 
of  hardware  versus  the  FFT,  but  when  the  data  are  complex  the  amount  of  hardware 
are  much  greater  than  a pipeline  FFT. 

Finally,  some  examples  of  FIR  filter  implementations  using  the  FNT 
hardware  will  be  given,  in  order  to  illustrate  the  effects  of  precision  with 
this  approach. 
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II . Theory 


There  is  a large  class  of  transforms  which  can  be  derived  when  the 
underlying  algebraic  structure  is  assumed  to  be  a finite  field  or  ring  (Ref.  1 ) . 
When  considering  hardware  implementations,  however,  only  the  Fermat  number 
transforms  offer  the  dual  advantages  of  no  multiplications  in  the  transform 
and  decomposition  into  a fast  algorithm  analogous  to  the  FFT. 

The  FNT  of  { x(k) } is  defined  as 


X(n)  = XI  x(k)  a <n^>  mod  F 
k=0  Z 


(1) 


where 

2t  th 

F = 2 " +1,  the  t - Fermat  number 

N is  a power  of  2 

N 

and  a is  an  Nth  root  unity  (i.e.,a  = 1 mod  F^  and 

am^  1,  1 < m < N) 

The  notation  <nk>  means  nk  modulo  N. 

The  only  FNT’s  considered  here  are  those  in  which  a has  a simple  binary 

representation,  so  that  the  multiplications  implied  in  equation  1 are  easy 

to  implement.  It  is  possible  to  show  that  for  N = 2 t + *,a=  2 is  an  Nth 

root  of  unity  [ 2 ] , In  this  case  the  multiplications  in  (1)  become  bit 

3-21'2  2^2 

shifts.  Agarwal  and  Burrus  [ 3 ] showed  that  « = 2‘  - 2 is  ai 

t + 2 2 

Nth  root  of  unity  for  N = 2 . Since  a =2  mod  F this  a is  usually 

referred  to  as  the  square  root  of  2 and  written  = %/  2.  For  general  F 

t + 2 

(t  > 4)  , N = 2 is  the  largest  power  of  2 for  which  a transform  can  be  defined. 


3 


Since  yf~2  has  a two-bit  representation,  multiplication  by  y/l* can  be  implemented 
with  one  subtraction.  In  addition,  the  case  of  N equal  to  a power  of  two 
is  important  because  it  allows  factorization  of  the  transform  into  a 
structure  similar  to  the  FFT  algorithm  for  the  Discrete  Fourier  Transform. 

These  properties  of  the  FNT  are  explained  in  detail  in  reference  3 . 

A fundamental  constraint  imposed  by  the  transform  definition  is  that 
the  wordlength  of  the  arithmetic  (determined  by  F ) is  linearly  related  to 
the  transform  length.  For  the  case  a = \/2,  N = 4 x wordlength.  It  is 
possible  to  ease  this  constraint  by  using  a two-dimensional  implementation 
of  the  one-dimensional  convolution  [ 4 ] . With  a = y/~2 "an  N x N two- 

dimensional  transform  can  be  defined  with  N = 2t+^.  Using  this  two- 

2 

dimensional  transform  a one-dimensional  circular  convolution  of  length 

can  be  implemented.  A 50%  loss  in  convolutional  efficiency  is  incurred  in 

2 

using  a two-dimensional  transform,  because  with  a length  N one-dimensional 
2 

FFT,  a length  N one-dimensional  circular  convolution  can  be  realized. 

Returning  to  the  definition  in  equation  (1),  note  that  the  transform  is 
defined  in  the  algebraic  system  of  integers  modulo  F . Thus,  the  implementation 
of  circular  convolution  using  (1)  will  result  in  circular  convolution  modulo 
F . In  particular,  if  the  circular  convolution 


N-l 


h 


< n-k  > 


is  computed  using  FNT’s  the  result  will  be 


/\ 

Y 

n 


N-l 


£ 

k=0 


h 

< 


n-k> 


!mod  F 

' 


(2) 


(3) 
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It  is  possible  to  determine  from  if  and  only  if  Y^  is  known  a priori 

to  lie  in  the  set  [ P,  P + F^-l  ] . With  no  prior  knowledge  of  the  ranges 

of  jx^}  and  |h^}  a conservative  estimate  of  the  number  of  bits  for 

and  h,  can  be  made, 
k 

Let  N = 2n  and  assume  that  lx,  | < 2a  and  | h,  j < 2^  for  k = 0,  1,  ...  N-l. 

F - 1 Ft-1 

Then  Y^  will  be  in  the  set [ - — ^ — , — — ] for  all  possible  sequences j J 

and  | | if  and  only  if  n + a + b ^ 2^-1.  This  bound  is  overly  conservative 

in  most  cases,  but  it  does  represent  a least  upper  bound.  Letting  t = 4, 

n + a + b < 15.  In  this  case,  if  n = 5,  (i.e.,  a length  21  convolution)  then 

a and  b could  both  be  5,  but  5 bits  is  probably  not  enough  accuracy  for 

convolution.  A more  likely  case  would  have  a = b = 10  so  that  n + a + b = 25, 

and  t = 5 (33-bit  arithmetic)  would  be  necessary.  For  most  typical  filtering 

32 

applications  a 33-bit  system  (i.e.,  a modulo  2 +1  system)  would  seem  to  be 

most  appropriate.  The  problems  with  precision  in  a 17-bit  system  can  be 
overcome  in  other  ways  [ 5 ] . 

A drawback  of  the  fact  that  the  wordlength  of  the  system  must  be  a 
power  of  2 is  that  it  is  not  possible  to  tradeoff  wordlength  versus  performance 
as  is  commonly  done  in  the  realization  of  digital  filters.  However,  the  computation 
of  the  convolution  using  the  FNT  (or  any  number  theoretic  transform)  is 
exact.  That  is,  after  the  quantization  of  the  input  data  and  the  filter 
coefficients,  no  additional  quantization  noise  is  introduced  in  the  filtering 
process.  Thus,  the  need  for  simulations  of  the  filtering  process  is  reduced 
to  determining  two  wordlengths  as  opposed  to  present  efforts  that  involve  all 
the  internal  precisions  of  the  calculation  [ 6 ] . 
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A more  complete  discussion  of  all  these  theoretical  issues  can  be  found 
in  references  1 through  5.  In  the  following  sections  our  attention  will  focus 
on  the  hardware  issues  encountered  in  realizing  the  FNT. 
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Ill . Representation  of  Numbers  Modulo  2+1 

3.1  Modulo  2^+1  Arithmetic 

In  this  section  a new  data  coding  scheme  is  described  for 
performing  arithmetic  modulo  2t+l.  The  main  result  is  that  modulo  2t+l 
arithmetic  can  be  implemented  in  a manner  that  is  similar  to  l's  complement 
arithmetic  (i.e.,  modulo  2t-l  arithmetic).  A description  of  the  arithmetic 
operations  of  the  FNT  when  the  data  was  encoded  in  2's  complement  notation 
can  be  found  in  reference  3 . The  arithmetic  operations  of  interest  are 
addition,  subtraction  and  multiplication  by  a power  of  2,  because  these 
are  the  basic  operations  in  the  butterfly  of  the  FNT  using  an  FFT-like  structure. 

Recall  that  in  the  ring  of  integers  mod  2^+1,  there  are  2t+l  elements. 

Thus,  t + 1 bits  are  needed  to  represent  all  possible  numbers.  If  a binary  coding 
scheme  such  as  2fs  complement  were  used,  then  whenever  the  MSB  was  one,  all 
the  other  bits  would  be  zero.  This  combination  would  represent  the  number 
2*.  Thus,  the  t ^ bit  is  used  only  in  this  one  case.  A new  coding  scheme 
is  proposed  in  which  the  collection  of  t+1  bits  [ b b ^ ...  b^  ] represents 
the  number  B in  the  following  way: 


1.  If  b =1  then  B = 0 

t b b 

2.  If  b =0  then  B = - (-1)  t_12t“1  - (-1)  t_22t"2  + 
That  is,  the  jth  bit  has  weight  2-*  and  sign  <r ^ , where 


- (-1) 


1 

-1 


if  b . =1 
J 

if  h . =0 
J 
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Example  1: 


Letting  t=4  and  2t  + 1=17, 

10000  represents  zero 
01010  =23-22  +2-1=5 

00011  =-23-22+2+l=-9=8  mod  17 

and  10101  is  an  illegal  combination. 

Ordinarily,  the  representation  proposed  would  yield  only  odd  numbers. 

However,  the  use  of  modulo  2t  + 1 arithmetic  means  that  both  even  and  odd 

numbers  will  be  represented.  To  see  this,  note  that 

B = (2btl  -1)  2t_1  + (2bt_2  -1)  2t-2  + ...  + (2b0-l) 

■ bt-l  2'  * bt-2  2t''  * * 2b0  ' (2‘  -1] 

= (bt_22t_1  + ...  + 2b0  - btl  + 2)  mod  (2t  + 1)  (4) 

The  term  in  brackets  takes  on  all  values  from  +1  to  2*  and  the  special  case 
of  zero  was  handled  by  = 1,  so  all  numbers  are  represented. 

Consider  arithmetic  operations  using  this  number  representation. 

First  of  all,  multiplication  by  a power  of  2 is  trivial.  If  the  number  is 
zero  (i.e.,  b^  = 1),  you  do  nothing.  If  the  number  is  non-zero,  the  low  order 
t bits  are  circularly  shifted  to  the  left  a number  of  places  equal  to  the  power 
of  2,  and  a bit  is  replaced  by  its  complement  as  it  enters  the  LSB  position. 
This  is  a consequence  of  the  fact  that  2^  = -1  mod  (2^  +1). 

Example  2: 

Letting  t = 4 and  2t  + 1 = 17,  8 is  represented  as  0 0 0 1 1 . Applying 
the  above  rule,  8x2=  16  =00111;  and  00111=8-7=  -1=  16  mod  17. 
Further,  8 x 8 = 0 1 1 1 0 = 14  - 1 = 13  = 54  mod  17. 

In  a hardware  implementation  the  MSB  is  used  as  a control  bit.  If  it 
is  one  then  the  number  is  zero  and  the  rotation  is  inhibited.  This  is  charac- 
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teristic  of  all  operations  using  this  new  coding  system. 


Another  easy  operation  is  that  of  forming  the  negative  of  a number. 
Obviously,  this  is  done  by  complementing  the  low  order  t bits  except  in  the 
case  where  b^  = 1.  Again,  the  MSB  is  a control  bit  that  would  inhibit  the 
operation  if  it  is  one.  Since  we  how  know  how  to  form  the  negative  of  a number 
and  multiply  by  2,  the  only  operation  left  to  consider  is  addition. 

If  either  or  both  of  the  operands  for  addition  are  zero  (i.e.,  b^  = l)  , 
then  there  is  no  addition  to  take  place;  so  these  special  cases  can  be  sensed 
and  the  addition  inhibited.  Now  consider  the  addition  of  two  numbers  A and 
B where  A / 0 and  B ^ 0.  Let 


A ■ Yt-l  *0  with  at=0, 
and  B = b^b^  ^ ...  with  bt=0. 

/s  /*>> 

Interpret  the  t LSB's  of  A and  B as  the  binary  representation  of  A and  B, 
and  form  the  sum  of'A'  and  /BS  using  unsigned  binary  addition  to  obtain  l. 


That  is. 


/\  -t-1  _t-2 

A = Vl2  * V22 

ft  , ,t-l  , -t-2 

*E  " bt-l2  * bt-22 


C ’ ct2  * ct-l2 


+ c. 


(5) 


/\  t 

It  is  possible  to  deduce  from  C the  desired  sum  C = (A  + B)  mod  (2  +1). 

Since  A = 2^+  2 mod  (2t+l)  and  B = 2b"  + 2 mod  (2t+l)  , C = 2a'+2b"+4  mod  (2t+l). 
If  C can  be  expressed  as  C = 2C  + 2 mod  (2t  + l)  with  Cat  bit  number,  then  the 
t bits  of  C are  the  t LSBfs  of  C.  There  are  two  cases,  depending  on  the  value 
of  c . If  c^  = 1,  then  'c'  = + 2t  + Cf  = Cf  - 1 mod  (2^  + 1). 
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Thus,  C = (2A  + 2^  + 4)  mod  ( 2t  + 1)  = (2C^  + 4)  mod  (2t+l)  = (2C'  + 2) 
mod  (2t+l),  and  C =C'. 

If  c = 0 then  C = 2C'  +4  mod  (2t  +1)  and  the  answer  is  C = C'+l. 
However,  this  results  in  an  extra  level  of  add  as  in  the  case  of  l's  complement 
arithmetic.  In  l's  complement  the  output  carry  is  added  to  the  LSB.  In 
this  new  mod  2^+1  arithmetic,  one  takes  the  output  carry,  complements 
it  and  adds  it  to  the  LSB.  Thus,  the  intital  claim  that  this  new  arithmetic 
is  only  as  complex  as  l's  complement  is  justified.  There  is  a small  amount 
of  additional  complexity  due  to  the  control  bit,  but  this  acts  only  as  an 
inhibit  signal. 

Example  3: 

Let  t = 4 and  2^  + 1 = 17. 


0 10  10 
0 0 0 1 1 


110  1 

0 10  10 
1 0 0 0 0 

0 10  10 

0 10  10 
0 0 10  1 


1111 
+ 1 


1 0 0 0 0 


5 

8 

01110  = 14  - 1 = 13 

5 

0 MSB  = 1 inhibits  the  addition 

5 

5 

5 _ 10  = -5  = 12  mod  17 


0 mod  17 
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In  this  last  example  note  that  the  second  add  automatically  produced  the  control 
bit  indicator  for  the  special  case  of  zero.  Now  let’s  examine  how  one  converts 
from  a binary  coded  representation  of  numbers  to  this  new  representation. 

3.2  Code  Conversion 

The  code  conversion  between  a binary  representation  and  this  new 
code  falls  into  two  cases.  Let  B be  a number  which  is  represented  in  both 
codes.  Let  b b ^ . . . b^  be  the  binary  representation  of  B and 

A 

a^a^_  i ...  a^  the  new  representation.  Also,  let  B be  the  number  re- 
presented by  interpreting  a^  ^a^_  2 * * * ao  as  a binary  code.  The 
conversion  rules  are  as  follows: 

/\ 

1.  If  B = 0 then  a =1,  B = 0,  and  b,  = 0 for  k = 0,  1,  ...  t.  This 

t k 

is  a special  case  and  is  done  separately. 

2.  If  B / 0 then  a^  = 0 and  B = (2^+  2)  mod  (2^+1).  Conversion 
from  the  new  code  to  binary  is  implemented  by  forming  2B  + 2 and  comparing  this 
sum  to  2^.  If  the  sum  is  larger  than  2^  then  2^  + 1 is  subtracted  to 

give  the  proper  binary  representation  of  B. 

If  the  binary  representation  is  given,  the  sum  B + 2^-1  is  formed.  If 
the  result  is  odd,  2^+1  is  subtracted;  and  finally,  this  result  is  right  shifted 
one  place.  The  resulting  t bits  are  the  t LSB’s  at  1 * • * ao 
Example  4: 

Let  t = 4 and  2*  + 1 = 17. 

1.  B = 1 0 0 0 0 = 16 

fr-c  16  +-2* 1 * *-5  ■-  -1— ) = 7 

The  new  representation  is  B = 0 0 1 1 1. 
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2.  B =00100 

s 

B = 4 

B=2-4+2=  10  =01010. 

3.  B = 0 0 0 0 0 -*=  B = 1 0 0 0 0. 

s 

In  section  5 this  new  coding  scheme  will  be  applied  to  the  logic  design 
of  the  butterfly  of  the  FNT  algorithm.  First,  the  overall  structure  of  the 
hardware  system  will  be  described  in  Section  4. 
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IV.  System  Description 


The  FNT  prototype  hardware  is  a realization  of  a 64«point  transform 
in  the  finite  field  of  integers  modulo  2^  + 1.  The  fast  FNT  algorithm  implemented 
is  a radix- 2 constant  geometry  decimation  in  frequency  (DIF)  decomposition 
of  the  FNT.  Figure  1 shows  a flow  diagram  of  this  algorithm  for  a 16-point 
transform  (Ref.  7 ) . The  constant  geometry  structure  was  chosen  because 

it  simplifies  the  memory  addressing  and  the  rotation  control. 

Figure  2 is  a block  diagram  of  the  complete  system  showing  the  four 
major  subsystems.  The  computational  element  (CE)  is  a radix-2  DIF  butterfly 
for  the  fast  FNT  algorithm;  the  memory  element  contains  128  seventeen-bit 
words  for  use  as  intermediate  storage  during  the  transform;  the  control 
element  is  a hardwired  implementation  of  the  fast  FNT  algorithm;  and  the  I/O 
section  provides  the  interface  with  the  Fast  Digital  Processor  (FDP) . 

The  goal  in  building  this  hardware  was  to  construct  a CE  that  would  operate 
at  a data  rate  of  40  MHz.  In  order  to  achieve  this  speed,  ECL  10K  circuits 
were  used.  The  basic  gate  in  this  logic  family  has  a propagation  delay  of 
2 nanoseconds,  and  thus  these  circuits  are  well  suited  for  very  high  speed 
systems.  Even  with  such  high  speed  logic  circuits,  two  levels  of  reclock 
and  fast  carry  addition  were  used  in  the  CE  to  realize  a working  system  that 
runs  reliably  at  38  MHz.  In  the  remainder  of  this  section  we  will  describe 
the  major  subsystems  of  the  FNT  hardware. 

4.1  The  Computational  Element 

The  basic  computational  element  of  the  FNT  consists  of  an  adder, 
a subtractor,  and  a rotator  for  multiplication  by  powers  of\/2.  Since 
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Fig.  1.  Radix-2,  16-point,  constant  geometry  FFT  (decimation  in  frequency). 
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(\/2)k  = k0  = 2 ^ (\/2)  ^ , where  k^  = 0 or  1,  the  rotator  can  be 

divided  into  two  operations:  a y/2  mutiplier  (implemented  as  a subtractor) 

kl 

with  a possible  inhibit  if  k^  = 0 and  a 2 multiplier  implemented  via  a 16:1 
multiplexer  (see  Figure  3).  The  butterfly  must  be  reclocked  twice  in  order 
to  achieve  the  25  nsec  clock  epoch.  Addition  (or  subtraction)  of  two  16-bit 
words  modulo  2^  + 1 using  carry  look  ahead  addition  is  just  fast  enough  to 
fit  into  the  25-nsec  clock  epoch.  Thus,  it  is  natural  to  reclock  after  the  first 
stage  of  add  and  subtract  and  after  the  \J2  multiplication.  The  multiplication 
by  powers  of  2 must  be  implemented  using  a 16:1  multiplexer  arrangement  in  order 
to  fit  into  the  25-nsec  clock  epoch.  A shift  register  implementation  would 
not  be  economical  at  such  high  rates. 

4.2  The  Memory  Element 

The  memory  element  contains  128  17-bit  words.  It  was  constructed 
using  17  F10405  128  x 1 ECL  RAM’s.  The  access  time  for  these  RAM's  is  less  than 
15-nsec,  so  they  are  well  within  the  speed  requirements  of  the  system.  For 
a 64»point  transform  one  only  needs  64  words  of  memory  if  the  transform  is 
done  in  place.  However,  in  the  choice  of  IC's  for  the  memory  element  it 
was  economical  to  select  a 128-word  chip,  and  so  the  constant  geometry 
structure  of  Figure  1 was  employed.  In  this  form  of  the  fast  FNT  algorithm, 
the  transform  is  not  done  in  place.  Thus,  the  memory  required  is  twice  that 
of  an  in  place  algorithm.  However,  the  memory  addressing  is  simplified  because 
it  does  not  depend  on  which  stage  of  the  transform  you  are  doing.  Since  there 
is  only  one  word  per  memory  cell,  two  memory  accesses  are  required  to  read 
the  two  operands  needed  for  the  butterfly . Similarly,  two  memory  write  cycles 
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Block  diagram  of  the  computational  element  of  the  FNT  system. 
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are  required  to  store  the  results  of  a butterfly,  and  the  two  data  paths  are 
multiplexed  into  the  memory.  Thus,  a total  of  four  memory  accesses  are  needed 
in  the  course  of  one  butterfly.  In  an  actual  pipeline  structure  there  would 
be  four  distinct  interstage  delay  memories  serving  one  butterfly.  Two  memories 
would  provide  the  input  operands  and  two  more  would  acquire  the  output  operands 
on  each  clock  pulse. 

4.3  The  Control  Element 

The  control  element  is  the  hardware  implementation  of  the  FNT 

algorithm.  It  controls  the  reading  and  writing  of  memory  during  the  transform 

and  determines  the  power  of  y/l  for  rotation.  The  system  operation  is  broken 

down  into  8 states.  Six  of  these  states  correspond  to  the  six  stages  of 

the  transform;  the  other  two  states  are  for  synchronization  and  I/O.  When 

the  device  is  doing  a transform  the  basic  timing  diagram  of  the  elementary 

computation  is  shown  in  Figure  4.  There  are  six  time  states,  each  25  nsec 

long.  During  t^  and  t^  the  two  operands  A and  B needed  for  the  transform 

are  read  from  memory  (see  Figure  3).  During  A + B and  A - B are  formed. 

During  t^,  A - B is  multiplied  by  1 orV2~"and  C = A + B is  set  up  to  be  written 

back  to  memory.  During  t^,  further  rotation  of  A - B by  a power  of  2 is 

k 

done,  D = (A-B)  x 2 is  set  up  for  its  return  to  memory,  and  C is  written 
into  memory.  Finally,  during  t^,  D is  written  into  memory.  Within  each  stage 
a counter  keeps  track  of  the  fact  that  32  butterflies  must  be  executed  during 
each  stage  and  64  operands  must  be  accessed.  The  rotation  value  r is  a function 
of  this  counter  and  the  stage  counter. 
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Fig.  4.  Timing  diagram  for  the  butterfly  of  the  FNT  syst 


4.4  The  I/O  Element 


The  I/O  section  handles  all  handshaking  with  the  FDP  in  order  to 
transfer  data  back  and  forth.  Communication  is  always  two-way.  Thus,  whenever 
the  FDP  sends  a word  to  the  FNT,  the  FNT  is  also  sending  a word  to  the  FDP. 

The  memory  element  is  used  as  a buffer  memory  for  I/O  transfers.  Data  is 
transferred  in  blocks  of  64  words.  While  the  64  data  points  to  be  transformed 
are  being  sent  to  the  FNT,  the  64  output  points  from  the  previous  transform 
are  being  received  from  the  FNT.  Data  from  the  FDP  is  loaded  in  linear  sequence, 
but  it  is  read  out  in  bit  reversed  order  when  being  sent  to  the  FDP.  Since 
the  result  of  the  transform  is  in  bit  reversed  order,  this  bit  reversed 
read  out  of  memory  will  undo  the  bit  reversal  of  the  FNT.  Thus,  the  data 
in  the  FDP  is  always  in  normal  order. 

The  code  conversion  from  2fs  complement  to  the  code  described  in  section 
3 above  is  also  performed  in  the  I/O  section  before  the  memory  is  loaded.  A 
similar  conversion  back  to  2's  complement  is  also  done  in  the  I/O  section. 
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V.  Logic  Design 

In  this  section  the  logic  design  of  the  major  elements  of  the  FNT 
system  will  be  described.  The  description  will  concentrate  on  the  arithmetic 
section  which  implements  the  butterfly  of  the  fast  FNT  algorithm.  A basic 
objective  of  the  logic  design  was  to  construct  a butterfly  which  would  operate 
at  a data  rate  of  40  MHz.  For  this  reason  ECL  10K  logic  was  employed  in  the 
entire  system  and  the  butterfly  was  pipelined  with  two  levels  of  reclock. 

The  other  subsystems  will  also  be  reviewed, but  the  emphasis  will  be  on 
their  relation  to  the  fast  FNT  algorithm. 

5.1  The  Computational  Element 

Figure  3 shows  a functional  diagram  of  the  butterfly  which 
consists  of  an  adder,  a subtractor,  a rotator,  input  buffer  registers 
and  Rg,  reclock  registers  R^,  R^,  and  R^  and  an  output  register  R„ . Register 
transfers  are  made  at  each  clock  pulse,  so  that  data  are  always  flowing  through 
the  CE  as  would  be  the  case  in  a pipelined  fast  FNT.  As  the  timing  diagram 
in  Figure  4 shows,  the  output  of  the  butterfly  is  only  written  into  memory 
from  R^  at  t^  and  t,-.  During  the  other  clock  epochs  the  contents  of  R-  may 
be  changing,  but  this  does  not  affect  the  algorithm. 

5.1.1  Adder  (Subtractor)  Logic: 

In  Section  3 a nonstandard  coding  scheme  for  data  manipulation 
in  the  FNT  was  derived.  Recall  that  the  rule  for  addition  of  two  non- 
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J and  B = | 


b16  b15 


bo  is: 


Step  1:  Add  the  16  LSB’s  of  A and  B with  the  carry  in  equal  to  zero. 

Step  2.  Complement  the  carry  out  from  step  1 and  add  it  to  the  sum 
from  step  1. 
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If  either  A or  B is  zero  (i.e.,  or  a^  = 1)  then  the  carry  must 
be  inhibited.  Finally  if  both  A and  B are  zero  the  MSB  of  the  sum  is 
set  to  one.  Figure  5 shows  a realization  of  the  addition  process.  The 
structure  of  Figure  5 is  inefficient  in  two  respects.  First  of  all, 
two  16-bit  adders  are  required,  although  the  second  one  is  simple 
because  one  input  is  zero.  Secondly,  the  addition  is  very  slow  because 
the  carry  must  propagate  through  both  16-bit  adders.  The  use  of  carry 
look  ahead  logic  as  in  Figure  6 will  improve  both  situations.  Now 
let's  look  at  the  details  of  the  implementation  using  ECL  building 
blocks . 

The  16-bit  adder  can  be  realized  using  4*MC10181  arithmetic  logic 

units  (ALU's).  Figure  7 shows  a block  diagram  of  the  4-bit  ALU.  In 

addition  to  producing  the  sum  outputs,  the  ALU's  also  produce  carry 

propagate  and  carry  generate  information  for  use  in  a carry  lookahead 

block.  Thus,  the  CLA  block  of  Figure  6 is  physically  spread  between 

the  ALU's  and  a carry  lookahead  logic  unit  (MC10179).  The  addition 

process  can  be  speeded  up  further  if  the  carries  into  each  ALU  are 

formed  in  parallel  from  the  output  of  the  CLA  logic.  Then  the  add  time 

will  bereduced  by  3 x (propagation  delay  time  from  to  C^+^). 

For  the  MC10181  this  is  approximately  10  nsec  which  is  significant  for 

realizing  a 25-nsec  clock  epoch.  Figure  8 shows  the  final  realization 

of  the  25-nsec  adder.  The  logic  expressions  for  the  carries  were  derived 

from  the  fact  that  C = P C + G for  each  ALU. 

n+4  n n n 
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17  - bit  RESULT 


Fig.  5.  Fermat  number  adder  (modulo  2*6+l)  using  two  16-bit  adders. 
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17-bit  RESULT 


Fig.  6.  Fermat  number  adder  implemented  with  carry  look  ahead  addition. 
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Fig.  7.  Functional  block  diagram  of  the  MC10181  ALU  in  the  addition  mode. 
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Worst  case  add  time  was  calculated  to  be  24  nsec  and  was  measured 
as  21  nsec.  The  flip  flop  setting  time  and  setup  time  account  for  the 
remainder  of  the  25-nsec  epoch. 

As  noted  above  the  subtractor,  A-B  can  be  implemented  by  complementing 
B and  adding  it  to  A.  The  use  of  the  MC10181  ALU  allows  the  B input  to  be 
complemented  internal  to  the  10181  via  mode  control.  This  results  in  a 
slight  design  change  in  the  adder  unless  the  complement  is  inhibited  when 
B is  zero. 

The  addition  A + B completes  the  calculation  on  one  rail  of 
CE.  The  result  is  held  in  the  register  and  then  is  moved  to 
to  be  written  back  into  memory.  On  the  other  rail  of  the  CE,  the 
quantity  A-B  is  stored  in  the  reclock  register  for  subsequent  rotation 
by  a power  of*J2. 


5.1.2  Rotation  by  ^2  ^ 

The  rotation  by  VF*  is  split  into  two  stages,  each  requiring 
a 25-nsec  epoch.  In  the  first  stage,  the  quantity  X = A - B is 
multiplied  by ^2  if  k is  odd.  The  v?  multiplier  is  merely  a subtracter. 

However,  since  the  output  is  zero  whenever  the  input  is  zero,  some 
simplification  of  the  subtracter  logic  results,  and  the  subtraction 
time  is  reduced.  A 2:1  multiplexer  at  the  output  of  the  sub- 
tractor selects  whether  the  input  is  to  be  multiplied  by^2^or  by  1,  and 
is  controlled  by  the  LSB  of  k.  The  result  of  this  calculation  is  stored 
in  the  reclock  register  Ry.  The  second  stage  of  the  rotation  is  a multiplication 
by  a power  of  2,  namely  GO-  <□  denotes  the  greatest  integer  function.) 


27 


This  multiplication  is  implemented  as  a 16:1  multiplexer  controlled 
by  the  upper  four  bits  of  the  binary  representation  of  the  power  of  2. 
Actually  the  16:1  multiplexer  is  realized  as  a cascade  of  two  4:1 
multiplexers  with  inverters  for  the  end  around  shifts.  The  shifting 
network  is  followed  by  a 2:1  multiplexer  which  selects  which  butterfly 

k 

output  (A  + B or  2 * Y)  is  to  be  stored  in  and  then  written  back  into 

k 

memory.  This  multiplexer  is  controlled  by  t ^ and  its  output  is  (2  • Y) • 

t-  + W ■ t^.  This  completes  the  description  of  the  CE;  the  other 
elements  of  the  FNT  will  now  be  described. 

5.2  The  Control  Element  and  Memory  Element 

The  control  logic  is  the  realization  of  the  Fermat  Number 
Transform  algorithm.  There  are  three  levels  of  control  and  each  is  driven  by 
a binary  counter.  The  highest  level  of  control  consists  of  8 states  formed 
from  the  3-bit  S counter.  Six  of  these  states  (S^,  S^...S^)  correspond  to  the 
six  transform  stages;  is  a synchronization  state;  and  is  the  I/O  state. 

The  second  level  of  control  is  the  indexing  within  a stage  of  the  FNT. 

A seven-bit  counter,  called  the  K counter,  is  employed.  Within  a transform 
stage  or  in  the  I/O  state  the  K counter  increments  64  times  because  each 
data  point  must  be  referenced  once.  When  the  64th  count  is  reached,  the  S counter 
is  incremented.  The  K counter  is  used  to  form  the  memory  address  and  the  power 
of  ^2~ for  rotation.  The  7 bits  are  required  for  addressing  all  128  words  of 
memory. 

The  lowest  level  of  control  is  the  time  state  counter  called  the  T 
counter.  The  time  state  counter  consists  of  the  six  states  t^,  t^,  ...  t<-  and 
determines  the  sequence  of  operations  in  the  butterfly. 
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The  realization  of  the  FNT  algorithm  requires  the  formation  of  memory 

addresses  and  rotation  exponents  from  the  three  control  counters.  Letting 

K = |k6,  K^...K^J  , the  rotation  exponent  (called  twiddle  control)  is 

t = t t . . . r where 
4 3 0 


k5  k4  k3  k2  kx 

j Ks  K4  K3  K2  ° 

t = ) Kc  K,  K_  0 0 

\ 5 4 3 

Kc  K.  0 0 0 

l 5 4 

k5  0 0 0 0 
0 0 0 0 0 


when 


s = (0  0 0)  = sQ 

S = (0  0 1)  = S 

S = (0  1 0)  = s2 

S = (0  1 1)  = s3 

s = (1  o 0)  = s4 

S = (1  0 1)  = s5 


The  memory  address  is  formed  in  one  of  four  ways  depending  on  the  counters. 
When  memory  is  being  written,  the  MAR  is  equal  to  K,  and  the  K counter  is 
incremented  after  the  write.  When  the  memory  is  being  read  during  the  trans- 
form, there  are  two  possible  memory  addresses.  During  t^,  the  MAR  is 
[F6  0 K5...  kJ  = p0(K);  during  tj,  the  MAR  is  [k^  1 K$...  kJ.  = P^K). 

Finally,  during  I/O  the  memory  is  used  in  bit  reversed  order  and  the  MAR 
equals  f K^K^. . . 


In  order  to  complete  the  specification  of  the  control  of  the  FNT, 
a register  transfer  sequence  is  provided  below.  This  corresponds  to  the  timing 
diagram  in  Figure  4. 
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Comments 

Read  operand  A from  memory 


tn  • CL:  MAR  = P (k) 

0 o v 

MDR  +Rd 

D 

• CL:  MAR  = P^(k)  Read  operand  B from  memory 

rb  RA 
MDR->-  Rd 

D 


t2  • CL: 
t3  • CL: 


SUM  (A,  B)-^RW 
DIFF  (A,  B)  + R 

A 

Rw  Rz  r 

t • MUL  ( \2,  X)  + 
o-  n . 


A + B and  A - B 


Multiplication 


• CL:  MAR  = K 

4 

K + 1 K 

R^  Mem 

ROT  ( r , Y)  + R^ 

t5  * CL:  MAR  = K 

K + 1 + K 

R-,-*  Mem 


Write  Memory 


Multiply  by  2^ 
Write  Memory 


5.3  The  I/O  Element 

The  I/O  element  was  designed  to  provide  asynchronous  transfer 
of  data  between  the  FNT  system  and  the  FDP.  The  I/O  is  enabled  only  when 
the  S counter  is  in  state  s^.  An  input  request  (IR)  from  the  FDP  is  first 
synchronized  to  the  FNT  clock  and  then  the  following  I/O  sequence  takes  place: 
1.  The  memory  is  read  with  the  bit-reversed  address,  MAR  = 

Kq  ...  K^J  . The  output  is  stored  in  register  R^.  The  output 
of  R^  is  code  converted  to  2fs  complement  and  is  transmitted  to  the 
FDP. 
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2.  The  input  acknowledge  line  (IA)  is  raised  approximately 
50-100  nsec  after  the  acceptance  of  the  IR. 

3.  Data  are  written  into  the  memory  after  being  code  converted 
from  2's  complement  to  the  new  code  used  internal  to  the  FNT.  This 
memory  write  occurs  approximately  250  nsec  after  the  acceptance  of  the 
IR.  The  K counter  is  incremented  after  the  write. 

4.  The  IA  is  cleared  after  1.2  jusec  and  is  held  down  for  a 
minimum  of  200  nsec. 

The  data  rate  achievable  with  this  interface  to  the  FDP  is  approximately 
one  word  every  1.5  Atsec. 
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VI . FDP  Peripheral 

In  this  section  the  operation  of  the  FNT  system  as  an  FDP  peripheral 
device  will  be  described.  Two  topics  will  be  discussed:  the  operation  of  the 

manual  controls  of  the  FNT  and  the  constraints  to  be  observed  by  FDP  programs  that 
use  the  FNT. 

Figure  9 shows  the  layout  of  the  control  panel  of  the  FNT.  The  function 
of  the  switches  and  lights  is  as  follows: 

1.  S/R  Switch:  The  Stop/Run  switch  is  a two  position  switch.  When 
the  switch  is  in  the  down  position  the  FNT  is  in  the  run  mode  and  normal  (full 
speed)  operation  of  the  hardware  is  enabled.  When  the  switch  is  placed  in  the  up 
position,  the  FNT  is  stopped.  The  stop  condition  is  indicated  by  the  red  light 
above  the  S/R  switch.  In  the  stop  mode,  depressing  the  CYCLE  button  will  step 
the  machine  by  one  clock  cycle.  This  feature  was  used  for  debugging  the  hardware 
and  shouldn't  concern  the  programmer. 

2.  The  MEM.T  switch  is  a two  position  switch  that  will  allow  the  memory 
element  of  the  FNT  to  be  tested.  When  this  switch  is  in  the  down  position, 

the  FNT  is  in  transform  mode.  When  the  switch  is  in  the  up  position,  the  device 
is  in  the  memory  test  mode  and  the  butterfly  is  disabled.  A red  light  above  the 
MEM.T  switch  signifies  that  the  FNT  is  in  memory  test  mode.  In  this  mode  data 
transferred  from  the  FDP  to  the  FNT  are  returned  to  the  FDP  unchanged  except  for 
a bit  reversal.  This  bit  reversal  results  from  the  I/O  addressing  modes  described 
in  Section  V.  A diagnostic  program  uses  this  mode  to  check  for  the  proper  operation 
of  the  memory. 

3.  The  RESET  button  is  used  to  initialize  the  FNT  machine.  When  the 
button  is  depressed,  all  counters  are  initialized  and  the  system  is  put  in  the 
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Fig.  9.  Front  panel  of  the  FNT  machine. 


synchronization  state.  For  proper  initialization  the  FNT  should  be  stopped  when 
a RESET  is  done. 

4.  The  CYCLE  button,  when  pushed,  will  step  the  machine  by  one 
clock  pulse  if  the  FNT  is  stopped.  If  the  machine  is  in  RUN  mode,  the  CYCLE 
button  has  no  effect. 

5.  The  I/O  light  signifies  that  the  FNT  is  in  the  I/O  state  where 
it  is  waiting  for  I/O  with  the  FDP . 

Initialization  of  the  FNT  for  use  in  the  transform  mode  requires 
the  following  steps. 

(i)  Place  MEM.T  switch  in  the  down  position.  The  red  light  above  the 
switch  should  be  off. 

(ii)  Place  the  S/R  switch  in  the  up  position.  The  red  "Stop"  light 
should  come  on. 

(iii)  Press  the  RESET  button  and  hold  in  for  1 or  2 seconds.  The  I/O 
light  should  be  off  after  this  operation. 

(iv)  Place  the  S/R  switch  in  the  "run"  position  (i.e.,  down).  The  "stop" 
light  should  go  off  and  the  I/O  light  should  come  on.  Now  the  device  is  initialized 
and  ready  to  accept  data  from  the  FDP.  We  now  turn  to  a discussion  of  the  pro- 
gramming features  of  the  FNT. 

The  FNT  is  connected  to  the  FDP  via  subchannel  6 of  I/O  channel  1.  Since 
data  communication  between  the  two  machines  is  always  full  duplex,  the  I/O  hand- 
shaking is  done  using  the  control  signals  of  the  FDP  input  channel.  Thus,  the 
execution  of  an  IOC  instruction  which  sets  the  input  request  line  will  cause 
the  FNT  to  take  as  input  the  contents  of  the  FDP  Eq  register  and  to  send  an  output 
word  which  will  be  strobed  into  the  E^  register  of  the  FDP.  The  timing  of  the  I/O 
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transfers  allows  E to  be  loaded  in  the  same  instruction  as  the  IOC.  Thus. 

o 9 

the  following  instruction  is  valid. 

TME  DATA  00  IOC  3 3 0 0 15 

In  order  to  do  a transform  64  data  points  must  be  loaded  into  the  FNT. 

The  FNT  machine  automatically  starts  the  calculation  of  the  transform  when 
the  64th  data  point  arrives.  At  the  present  38-MHz  clock  rate,  the  transform  takes 
about  30  jusec.  To  obtain  the  results  of  the  transform  64  more  I/O  transfers  must 
be  done.  The  full  duplex  mode  of  operation  allows  one  to  load  data  for  a new 
transform  and  obtain  the  results  of  the  previous  transform  at  the  same  time. 

Since  the  I/O  of  64  data  points  takes  2 or  3 times  as  long  as  the  transform 
itself,  this  is  an  important  factor  for  obtaining  maximum  performance  from 
the  machine. 

In  order  to  do  convolution  using  the  FNT  hardware,  it  is  necessary  to  do 
the  reference  spectrum  multiplication  (mod  2 ^ + 1)  in  the  FDP.  A possible  program 
to  do  this  is  given  below.  The  program  will  multiply  the  two  64-point  arrays 
DATA  (MA)  and  DATA  (MB)  modulo  (2^  + 1)  and  store  the  result  in  DATA  (MB)  . 

Since  the  FNT  hardware  will  only  compute  a forward  transform,  a reordering 
of  the  data  in  the  FDP  is  necessary  to  obtain  an  inverse  transform.  The 
64-point  array  x(k)  to  be  inverse  transformed  must  be  flipped  according 
to  the  formula 

>f(k)  = x ( < 64  - k > mod  64) 

Then  the  forward  FNT  of  x"(k)  is  the  inverse  FNT  of  x(k). 
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(PROGRAM  Tu  DO  REFERENCE  MULT J PL Y 
(MULTIPLIES  D A T A ( M A ) BT  DA!A(m9) 
(PUTS  RESUl T IN  DATn(MB> 

(HOLS  FUUR  MULTIPLIES  PER  PASS 

(FEKMaTIMA)  - 200001 

(PERM AT (MB)  - 177777 

MJlUP 

AI1BI2  DATA  7 7 
MULURA  A I 3b  1 4 OAlA  7 7 

/MIF//MIF 
B 1 24  o 10 
Al3gi4  Data  7 7 
/ TPp// 1 PR 

A 1 1 B 1 2 DATA  7 7 

niB/TOR/MIB/TDR 

MuD/trI/MuE/TR1 
AI13PI2A  0 in  10 
M IB/MIF/MIB/MIF 
TPR//TPR/ 

R ? b r A 1 A 7 
TR1//TRI/ 

IAU//IAO/ 

RIM //RMI/ 

R A b DA  I A 7 
/MRF//MRF 

K4B  DATA  7 
K?B  DA i A 7 
A 1 1 B 1 2 DATA  7 7 
SOU  0 2 

F X 2 


F X4 


END 


FOR  It,  BIT  FNT 


y i y fermat  io 
yly  63.  7 
ypx  -l  7 
YPX  -1  7 
/TIQ//T10 
/mjL//MUl 
/TIO//TIO 
YPx  -1  7 
YPX  3 7 

TIQ/TRI/TIQ/TRI 

/IAQ//IAQ 

/RMI//RMI 

JNR  FX2  A 

TIO//TIO/ 

UNR  FX4  1 
TDR//TDR/ 
TRT//TRI/ 

YPX  -1  7 

JNR  FX4  2 
YPX  -1  7 
JNR  FX2  A 
YPX  -1  7 
JPX  MuLuPO  7 
YPX  -1  7 
XJP  -1  RTRN 

XJP  0 1 
/IPP// 

XJP  0 1 
///IPR 
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VII . Diagnostic  Programs 


In  order  to  debug  and  maintain  the  FNT  hardware,  three  diagnostic 
programs  are  available: 

1.  Memory  Test 

2.  Rotator  Test 

3.  Adder/Subtractor  Test 

Each  diagnostic  program  tests  a separate  functional  element  of  the  hardware, 
and  the  tests  form  a hierarchy  in  that  each  diagnostic  test  relies  on  the  proper 
operation  of  the  functional  element  tested  by  the  previous  diagnostics. 

The  memory  test  is  used  with  the  hardware  in  the  memory  test  mode.  The 

purpose  of  the  test  is  to  check  the  memory  element  of  the  FNT  and  the  data  comm- 

unication link  between  the  FDP  and  the  FNT.  Data  words  are  generated  in  the  FDP 
and  sent  to  the  FNT  in  blocks  of  64.  Then  as  a block  of  64  words  is  being  sent, 
the  words  being  received  from  the  FNT  are  compared  to  the  previous  block  sent 
and  checked  for  errors.  If  an  error  occurs,  the  FDP  halts  with  information 
concerning  the  error  in  the  AE  lights.  Successful  completion  of  the  memory 
test  allows  one  to  test  the  FNT  in  its  transform  mode  with  the  remaining 
two  diagnostic  tests. 

Both  of  the  transform  tests  use  known  transform  pairs  to  test  different 
parts  of  the  FNT  butterfly.  In  the  rotator  test  the  rotation  element  of  the  CE  is 
subject  to  test.  The  following  transform  pair  is  used. 

FNTje(r)|=  x(r)  = [l  (^2)r  ...  C^2)rCn_1) 

where  e^=  f 0 ...  1 ...  0 1 

*■  th  T . . ' 

r position 
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(r  ) 

Since  the  transform  of  e involves  no  additions  or  subtractions  (except 
with  zero),  this  test  serves  to  check  out  the  rotation  element  of  the  butter- 
fly. As  before,  the  program  halts  if  an  error  is  detected  and  displays  the 
erroneous  data  in  the  AE  lights. 

In  order  to  check  the  adder/subtractor  pair  in  the  butterfly,  another 
known  transform  pair  is  used. 


i t) 

f 

FNT  • 

!lao  • 

o 

X>  4r 
O 

• i 

o 

n 

a+b,  a-b,  a+b,...,a-b 

32nd  position 

The  first  two  elements  of  the  transform  are  examined  to  check  the  addition  and 
subtraction.  The  numbers  a and  b are  varied  to  check  all  possible  cases. 

The  successful  completion  of  all  three  diagnostic  tests  should  guarantee  that 
the  FNT  hardware  is  operating  properly. 
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VIII.  Hardware  Comparison  of  the  FNT  vs  the  FFT 


Since  one  purpose  in  building  a hardware  prototype  of  the  FNT  was  to 
gain  a working  knowledge  of  the  amount  of  hardware  needed  to  implement  an 
FNT  convolver,  it  is  appropriate  to  compare  this  method  with  the  standard  FFT 
implementation  of  convolution.  It  is  impossible  to  make  a comparison  that  will 
apply  in  all  cases,  or  even  a majority  of  cases.  Therefore,  a specific  application 
has  been  chosen  for  comparison;  namely,  digital  filter  implementation  for  radar 
signal  processing.  The  problem  areas  to  be  described  below  should  be  representative 
of  the  general  problems  associated  with  FNT  convolution  implementation. 

The  signal  bandwidths  encountered  in  radar  signal  processing  (10-30-MHz) 
require  a pipeline  architecture  for  either  the  FNT  or  the  FFT  [ 8 ].  Further- 
more, the  length  of  the  convolution  to  be  implemented  is  assumed  to  be  large 
(e.g.,  512  or  greater).  Two  cases  will  be  considered:  a length  1024  linear 
convolution  of  real  data  and  a length  1024  convolution  of  complex  data.  Four 
measures  of  hardware  complexity  are  the  basis  of  comparison:  the  number  of 

butterflies  per  output  point,  the  number  of  reference  spectrum  multiplies 
per  output  point,  the  total  amount  of  interstage  delay  line  memory  in  the 
forward  and  inverse  transforms,  and  the  total  amount  of  reference  spectrum 
memory.  The  FFT  implementation  will  be  considered  first. 

For  either  real  or  complex  data,  it  is  assumed  that  the  FFT  implementation 
employs  an  11-stage  radix-2  pipeline  FFT  in  both  the  forward  and  inverse 
transforms.  (Note:  for  real  data  it  is  possible  to  do  a length  N transform 

with  two  length  N/2  transforms  and  some  overhead  to  combine  the  two  shorter 
transforms  [ 9 ]#  However,  the  overhead  amounts  to  an  additional  butterfly  so 
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there  is  little,  if  anything  to  gain  using  this  fact  in  a pipeline  FFT) . 

Much  work  has  been  done  on  the  hardware  complexity  of  pipeline  FFTfs  and 
the  four  measures  of  complexity  we  are  considering  here  are  detailed 
in  ref.  10  . For  the  case  of  a length  2048  pipeline  FFT  convolver  the  number 
of  butterflies  per  output  point  is  2 log^N  = 22,  assuming  50%  convolutional 
overlap.  Likewise,  two  reference  spectrum  multiplies  must  be  done  per  output 
point.  The  amount  of  interstage  delay  line  memory  can  be  calculated  from 
the  formula 


I DM 


(r  + 1) 


(6) 


where  r is  the  radix  of  the  transform.  Thus,  for  two  radix  -2  pipelines, 

the  total  is  3N  = 6144  = 6K  words  of  memory.  Finally,  the  reference  memory 

requires  2K  words  of  memory.  Now  we  turn  our  attention  to  the  FNT. 

A pipeline  FNT  structure  is  identical  to  the  pipeline  FFT  except  in  the 

i 2 7T  k /N 

butterfly  where  rotation  by  eJ  (in  the  FFT  case)  is  replaced  by  multiplication 

___ 

by  Jl  . Thus,  many  of  the  results  quoted  above  are  applicable  to  the  FNT.  Since 
the  FNT  naturally  processes  real  input  data,  the  cases  of  real  and  complex  con- 
volution require  different  realizations.  In  both  cases,  however,  a two- 
dimensional  implementation  of  the  convolution  is  required  [ 4 ].  The  two 


L<r 


: to  be 

convolved  are  H and  X, 

where 

x(0) 

x(l) 

x (L)  

x(N-L) 

x(L-l) 

0 

x(2L-l)  .... 

0 

x(N-l) 

0 

0 

0 

■ /C  A ■■  ... 

0 

* 


64 


T 

L = 32 


I 


(7a) 
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and 

r 


h(n-L+l)  h(l) 


h (N-2L+1) 


H= 


h(N-l)  h (L- 1) 
h(0)  h(L) 

h(l) 


(7b) 


h(L-l) 


The  length  1024  convolution  of  real  signals  can  be  implemented  with  a 64  x 64 

transform.  The  input  data  is  the  array  X of  equation  7 and  it  is  advantageous 

to  exploit  the  fact  that  half  of  the  X array  is  zero  by  doing  the  row  transforms 

first.  Hence,  96  length  64  FNTfs  must  be  computed  for  the  entire  2-D  transform. 

The  total  number  of  butterflies  for  the  complete  convolution  is  2 x 96  x 32  log0  64 
12 

9x2  , or  36  butterflies  per  output  point.  This  is  reflected  in  the  structure 

of  Figure  10  where  there  are  36  butterflies  working  in  parallel  --  six  in 

12 

each  64-point  FNT.  Since  2 reference  function  multiplies  must  be  done 
during  each  convolution,  four  reference  multiplies  per  output  point  are  required. 
The  interstage  delay  memory  requirement  is  calculated  from  the  individual  64- 
point  transforms.  The  first  and  last  64-point  FNT's  are  computing  32  trans- 
forms simultaneously.  This  is  accomplished  by  making  all  interstage  delay 
lines  32  times  as  long  as  in  a standard  pipeline  and  by  modifying  the  control 
to  switch  at  l/32nd  the  speed  of  a normal  pipeline.  That  is,  the  rotation 
exponents  and  commutator  switches  are  only  changed  at  every  32nd  clock  pulse. 

The  other  four  FNT's  employ  a normal  pipeline  structure.  Applying  equation  6, 
the  total  interstage  delay  memory  is  3 * 64  • 32  + 2 • 3 • 64  = bK  + 384  ~ 6.4K 


41 


REFERENCE  ||8-M!409 

FUNCTION 


a m 


2 


C\J 

rO 


co  u- 

Ljj  O 

£ CT 

^ K 
Ul  o 

< < 

Q 
UJ  UJ 

o co 

g <[ 

UJ 
a)  cr 

y 2 


!< 


* 


CO  Ul 
UJ  h~ 
X <1 

o CC 

t U- 

CO  o 
2 
cr  co 

O rO 


I- 

II 


O UJ 

2 cr 
< < 


42 


Fig.  10.  Pipeline  realization  of  a two-dimensional  (64  x 64)  FNT  convolution. 


Finally,  it  is  easy  to  verify  that  the  amount  of  reference  function  memory 
is  4K. 

If  the  signals  to  be  convolved  are  complex  then  one  possible  implementation 
is  to  handle  the  real  and  imaginary  parts  of  the  data  separately.  The  number 
of  butterflies  per  output  point,  the  interstage  delay  memory,  and  the  reference 
spectrum  memory  are  all  doubled.  However,  the  number  of  real  multipliers  for 
the  reference  function  multiply  is  quadrupled  because  the  multiplication  is 
complex.  Table  1 summarizes  the  three  systems  versus  the  four  hardware  measures. 
Notice  that  the  FNT  always  requires  more  memory  and  more  computational  elements 
than  the  FFT.  Hardware  savings  are  possible  because  most  of  the  hardware 
cost  of  the  FFT  is  concentrated  in  the  butterfly  elements  (up  to  80%)  and 
because  the  FNT  butterfly  requires  from  one-third  to  one-sixth  the  hardware 
of  an  FFT  butterfly.  These  remarks  apply  to  the  FNT  when  the  data  to  be  convolved 
arereal,  but  when  the  data  are  complex  the  situation  becomes  much  more  difficult 
because  all  measures  of  hardware  complexity  are  increased  by  a factor  of  three 
or  four.  Therefore,  we  will  concentrate  on  the  case  of  real  convolution  in  the 
following  discussion. 

In  order  to  be  more  specific,  let’s  divide  the  hardware  cost  of  any 
convolver  into  two  parts,  the  percentage  of  hardware  for  the  butterfly  elements 
and  the  percentage  of  hardware  for  the  rest  of  the  system.  Assume  the  following 
relations  between  the  FFT  and  the  FNT 


(8) 


BFFT  yTFFT 


(9) 


*-tfnt  " bfnt-)  v *-tfft  " BFFT') 


(10) 
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Table  1.  Hardware  measures  for  FFT  and  FNT  implementations  of  length  1024 
convolution. 


FFT  Real  or 
Complex  Data 

1 

FNT  Real 
Data 

FNT  Complex 
Data 

Butterflies 

22 

36 

72 

Reference 

Multiplies 

2 

4 

16 

Interstage 

Delay 

Memory 

6K 

complex  words 

6-4K ★  ** 
real  words 

12. 8K 

real  words 

Reference 

Spectrum 

Memory 

2K 

complex  words 

4K 

real  words 

8K 

real  words 

One  complex  word  will  contain  approximately  25  bits  for  typical  high- 
precision  radar  applications. 


★ ★ 

One  real  word  contains  33  bits  for  FNT 
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where  B denotes  the  butterfly  cost  and  T the  total  convolver  cost.  Letting 


T, 


FFT 


1 the  total  hardware  for  the  FNT  is 


T. 


FNT  ' (TT  A - v)  “ * u 


(11) 


and  the  savings  can  be  expressed  as 


(12) 


Thus, the  problem  is  now  that  of  determining  realistic  values  for  the  three 

ratios  X,\i  and  V . If  we  assume  that  the  accuracy  requirements  of  the  system 

are  high,  then  the  FFT  implementation  should  use  a hybrid  floating  point  scheme 

as  described  in  reference  6 . This  study  described  a complex  data  format  using 

27  bits  --  11  bits  each  for  the  real  and  imaginary  parts  of  the  mantissa  and 

32 

5 bits  for  a common  exponent.  In  contrast  the  FNT  would  employ  modulo  2 +1 

arithmetic,  implying  a 33-bit  data  word.  From  this  information  we  would  like 
to  argue  that  a reasonable  value  for  v is  two.  Three  components  must  be 
considered.  First,  the  numbers  of  reference  multiplies  being  compared  are  two 
11  x 11  complex  multipliers  and  four  32  x 32  real  multipliers.  Since  the  hard- 
ware complexity  of  an  array  multiplier  (which  would  be  required  at  radar  speeds) 

2 4*3^ 

is  proportional  to  n , the  four  real  multipliers  amount  to  about  TT~4—  = 4^ 

times  the  hardware  of  the  two  complex  multipliers.  Secondly,  the  interstage 

delay  memory  is  6K  words  of  27  bits  each  for  the  FFT  versus  6*4K  words 

of  33  bits  each  for  the  FNT,  or  a factor  of  1.3  in  favor  of  the  FFT.  Similarly, 

the  reference  function  memory  differs  by  a factor  of  2.  44  in  favor  of  the  FFT. 

Depending  on  the  detailed  logic  real izat ion,  the  value  of  v will  vary,  but 
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v = 2 conveys  the  fact  that  the  two-dimensional  convolution  using  the  FNT  wastes 

a factor  of  two  in  non-butterfly  hardware.  This  is  a fundamental  limitation 

of  the  FNT  for  long  convolutions.  At  this  point  we  can  isolate  the  impacts 

of  the  butterfly  hardware  on  the  possible  savings  for  the  FNT.  Figure  11  shows 

a plot  of  hardware  savings  versus  the  percentage  of  butterfly  hardware  in  the 

FFT  ( y ) . Several  curves  are  shown  with  X as  a parameter.  X is  the  ratio  of 

FNT  butterfly  hardware  to  FFT  butterfly  hardware.  The  range  of  values  for  X 

are  typical  for  the  high  speed  implementations  required  in  radar  signal  processing. 

The  value  of  v was  assumed  to  be  two  and  if  it  were  larger  then  the  horizontal 

intercept  would  be  moved  to  the  right  to  a higher  value  of  y . 

Independent  of  the  exact  values  of  X , y and  v , Figure  H clearly  shows 

that  the  FNT  will  provide  hardware  savings  over  the  FFT  only  when  the  FFT  hard- 

* 

ware  is  dominated  by  the  CE  cost  as  in  the  case  of  a pipeline  implementation. 
Furthermore,  the  signals  to  be  convolved  must  be  real,  because  complex  data 
essentially  require  two  real  FNT  convolvers.  Although  short  convolutions 
(e.g.,  length  64)  have  not  been  discussed  here,  it  is  worth  mentioning  that  there 
is  a good  chance  for  significant  hardware  savings  over  the  FFT  because  a one- 
dimensional FNT  can  be  used.  This  means  that  the  non-butterfly  hardware  of  the 
two  systems  will  be  approximately  the  same  (i.e.,v-  1)  and  the  savings  will 
begin  at  a value  of  y near  zero. 

★ 

Note:  As  transform  size  increases  the  memory  cost  of  a pipeline  increases  faster 

than  the  CE  cost  and  will  become  a significant  fraction  of  the  total  cost  for 
large  transforms  (e.g.,  length  16K  or  32K) . 
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RATIO  OF  FFT  CE  COST  TO 
TOTAL  FFT  CONVOLVER  COST  (%) 


Fig.  11.  Region  of  potential  hardware  savings  of  the  FNT  over  the  FFT  for  a 
length  1024  aperiodic  convolution  of  real  signals. 
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IX.  Examples  of  Filter  Implementation 


In  order  to  demonstrate  the  capabilities  of  the  FNT  hardware,  two  filters 
were  implemented  and  the  results  are  shown  below.  The  first  filter  chosen 
was  a length  33  FIR  lowpass  filter  with  a passband  cutoff  frequency  of  0.10, 
stopband  cutoff  frequency  of  0.15  and  stopband  attenuation  equal  to  -33  db. 

The  second  filter  was  a length  33  bandpass  filter  with  -44.5  db  attenuation  in 
the  stopbands.  The  cutoff  frequencies  of  the  bandpass  filter  were  0.15  for 
the  lower  stopband,  0.27  for  the  upper  stopband  and  0.20  to  0.22  for  the  pass- 
band.  Figures  12a  and  12b  show  the  ideal  frequency  responses  of  the  two 
filters  obtained  using  an  FFT  of  the  impulse  responses.  The  frequency  response 
is  shown  from  F = 0 to  F = 0.5  (assuming  the  sampling  frequency  is  unity)  and 
the  horizontal  lines  denote  steps  of  20  db. 

The  method  chosen  to  show  the  FNT  implementation  of  convolution  was  to 
filter  a discrete  time  linear  FM  signal  with  the  lowpass  and  bandpass  filters. 
The  output  of  the  filters  traces  out  an  approximate  frequency  response  of  the 
filter,  because  the  input  linear  FM  sweeps  across  the  frequency  range  of 
interest.  Figures  13a  and  13b  show  the  results  of  the  convolution  when  the 
input  signal  is  represented  with  7 bits.  For  the  lowpass  filter  it  is  possible 
to  use  8 bits  for  the  filter  coefficients  without  overflowing  the  computation 
but  the  bandpass  filter  can  use  only  7 bits.  The  response  of  the  lowpass  filter 
is  good  approximation  to  the  ideal  response  of  Figure  12a  and  much  of  the 
roughness  can  be  attributed  to  the  fact  that  the  linear  FM  only  traces  out  an 
approximate  spectrum.  The  bandpass  filter  (Figure  13b)  is  much  worse  because 
specifications  of  the  filter  are  more  stringent  and  one  less  bit  is  available 
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for  filter  coefficients.  The  conclusion  one  reaches  is  that  the  FNT  based  on 
= 2^  + 1 is  marginal  for  the  implementation  of  most  filters. 

It  is  possible  to  extend  the  precision  of  the  FNT  implementation  in  sev- 
eral ways.  One  possibility  is  a method  based  on  the  Chinese  Remainder  Theorem 
[ 5 ].  In  this  method,  two  convolutions  are  computed  modulo  two  relatively 
prime  numbers  and  the  results  are  combined  term  by  term  using  the  Chinese 
Remainder  Theorem.  In  particular,  {h^}  and  {x^}  are  convolved  modulo  (2lb  + 1) 

to  obtain  {y  } and  modulo  2^  to  obtain  {?  }.  Assuming  the  true  output  {Y  } 
n n n 

satisfies  Y c 2 (2^  + 1)  then  Y can  be  written 

n v J n 

Yn  ■ Y„  * f'  (216  + 1) 

Then  assuming  f <2^,  we  have 


f = ( Yn  - Yn)  mod  2' 


Note  that  the  convolution  of  {h^}  and  {x^}  modulo  2 can  be  done  with  the  FNT 
hardware  by  convolving  {h^  mod  2^}  with  {x^  mod  2^}  and  taking  the  result 
modulo  2^. 

Thus,  by  applying  the  Chinese  Remainder  Theorem,  5 extra  bits  of  precision 
are  available  for  the  filter  implementation.  Figures  14a  and  14b  show  the 
response  to  linear  FM  of  the  lowpass  and  bandpass  filters  with  11-bit  filter 
coefficients.  In  both  cases,  the  response  is  very  near  the  ideal  of  Figure  12 
with  the  slight  differences  due  to  the  linear  FM  input  signal. 
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(a) 


(b) 


Fig.  12.  Ideal  filter  responses. 

a)  lowpass  filter 

b)  bandpass  filter 
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(b) 


Fig.  13.  Filter  response  with  linear  FN  input  using  FNT  convolution. 

a)  lowpass  filter 

b)  bandpass  filter 
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(b) 


Fig.  14.  Filter  response  with  linear  FN  input  using  the  FNT  and  the  Chinese 
Remainder  Theorem  to  increase  precision. 

a)  lowpass  filter 

b)  bandpass  filter 
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X.  Summary 

A hardware  implementation  of  the  Fermat  Number  Transform  has  been 
built.  In  the  course  of  designing  this  machine  a new  representation  for 
numbers  modulo  2^+1  was  derived  to  facilitate  the  arithmetic  operations 
of  the  FNT  butterfly.  The  design  goal  of  a 40-MHz  clock  rate  through  the 
butterfly  was  nearly  achieved  with  the  final  reliable  clock  rate  being 
38  MHz. 

The  FNT  system  was  built  as  a peripheral  device  for  the  FDP  and 
the  software  necessary  use  the  FNT  for  convolution  has  been  developed  and 
was  explained  here. 

Finally,  a comparison  with  the  FFT  for  special  purpose  pipeline 
hardware  convolution  has  been  made  based  on  the  present  design.  A con- 
clusion of  this  comparison  is  that  the  FNT  is  a useful  alternative  if  the 
data  to  be  filtered  are  real  and  the  computational  elements  are  a large  part 
of  the  convolver  cost  as  in  a pipeline  architecture. 
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